A balanced communication-avoiding support vector machine decision tree method for smart intrusion detection systems

The Internet of Things field has created many challenges for network architectures. Ensuring cyberspace security is the primary goal of intrusion detection systems (IDSs). Due to the increases in the number and types of attacks, researchers have sought to improve intrusion detection systems by efficiently protecting the data and devices connected in cyberspace. IDS performance is essentially tied to the amount of data, data dimensionality, and security features. This paper proposes a novel IDS model to improve computational complexity by providing accurate detection in less processing time than other related works. The Gini index method is used to compute the impurity of the security features and refine the selection process. A balanced communication-avoiding support vector machine decision tree method is performed to enhance intrusion detection accuracy. The evaluation is conducted using the UNSW-NB 15 dataset, which is a real dataset and is available publicly. The proposed model achieves high attack detection performance, with an accuracy of approximately 98.5%.


Scientific Reports
| (2023) 13:9083 | https://doi.org/10.1038/s41598-023-36304-z www.nature.com/scientificreports/ time. In light of this introduction, this paper seeks to provide a more accurate intrusion detection system based on the Balanced Communication-Avoiding Support Vector Machine Decision Tree (BCA-SVMDT) method. The proposed aim is to support the complexity by providing accurate detection in less processing time than other related works. The goals are as follows: 1. Model an intrusion detection system based on BCA-SVMDT to efficiently detect cyberspace attacks. 2. Verify the performance of the proposed model according to the accuracy, precision, recall, and F-score. 3. Compare the proposed model with intrusion detection systems based on traditional machine learning methods.
The remainder of this paper is organized as follows. Related works are cited and discussed in section two. Section three describes the proposed intrusion detection system preformed according to the BCA-SVM and DT methods. Experiments and findings are highlighted in section four. Finally, the conclusion and future work are presented in the last section.

Related works
Intrusion detection systems seek to avoid network attacks. These attacks can be categorized into four essential types: (1) The attacker overloads many resources (memory, network interface, services, etc.). This type of attack is named the Denial of Service (DoS) attack. (2) The attacker attempts to use the system as a normal user. This type of attack is called the Remote-to-Local (R2L) attack. (3) The attacker logs into the system like a normal user and then attempts to change administrator terms. This type of attack is named the User-to-Root (U2R) attack. (4) The attacker tries to scan the network traffic to find useful information to remote access computers. This type of attack is called the probe attack.
In this section, we focus on SVM-based IDS methods proposed in the literature. Wang et al. 8 attempted to detect intrusions using a smaller dataset provided by the primary training data. The authors perform three steps to ensure the detection of intrusions as follows: (1) extract the detection models from the dataset, (2) Analyze the training audit data, and (3) Detect network anomalies. The first step is ensured based on the exemplar extraction method. The second step used affinity propagation and K-means clustering. The third step applied Principal Component Analysis (PCA), a k-NN, and an SVM to detect abnormal network behavior. The Knowledge Discovery and Data Mining Tools Competition (KDD Cup) dataset and real HyperText Transfer Protocol (HTTP) traffic are employed to evaluate their intrusion detection system.
He et al. 9 attempted to accelerate detection by using the twin SVM method, which requires less training time than the SVM. The proposed IDS is composed of twin SVM and Radial Basis Function (RBF) kernels. Unfortunately, this method requires considerable prediction time. The authors evaluated their IDS on R2L and U2R attacks through the KDD Cup dataset. Lin et al. 10 aggregated SVM and decision tree classifiers to find significant features related to attack behaviors. The proposed method sought to select decision rules using the KDD Cup dataset and detect predicted attacks. Shang et al. 11 combined the SVM classifier and Particle Swarm Optimization (PSO) method. The authors aimed to detect anomalies using one class of samples trained by the PSO method. The evaluation is performed on real network traffic data, and the comparisons are limited. Khreich et al. 12 focused on system calls and traces. The authors aggregated between the frequency and temporal information to be used by the SVM in the training phase. Their IDS is verified according to the Australian Defence Force Academy Linux Dataset (ADFA-LD).
Cid-fuentes et al. 13 used SVM and decision tree classifiers to improve the accuracy of an IDS. Teng et al. 14 built their model on 2-class SVM and decision tree methods. The authors aimed to decrease the overhead and enhance the attack detection rate. Hu et al. 15 combined the SVM with Adaboost classifiers. The authors used Adaboost because it was an iterative method. Adaboost enhanced the classification performance by learning from the mistakes and weaknesses of classifiers. Hu et al. provided global detection in each node by using Adaboost twice. The first use selected the decision stumps, and the second use improved the online Adaboost.
Aburomman et al. 16 sought to increase the accuracy of an IDS using a k-NN classifier. Their proposed system used six SVM and six k-NN models in the training phase. The authors performed the PSO and Weighted Majority Algorithm (WMA) methods for the decision phase. Wu et al. 17 presented an IDS based on deep belief networks and a weighted SVM. The performance of the deep belief network is enhanced by the learning rate method. Then, the SVM is trained using the PSO method. The results lead to an efficient weighted SVM.
Anil et al. 18 introduced an IDS using the Genetic Algorithm (GA) and entropy function. This method provides a high ability to extract features from the KDD Cup dataset. The authors applied a Self-Organized Feature Map (SOFM) with the SVM to find the similarity between groups in the dataset. The authors showed that their approach achieved a high detection rate with low computation time. Yi et al. 19 proposed an incremental SVM method to decrease the noise that appeared due to feature differences. A modified kernel function based on the Gaussian function is used with the SVM during the training phase.
Chitrakar et al. 20 introduced an approach based on an SVM with the half-partition method. The incremental feature of the SVM and the concentric-ring method allowed real-time detection of intrusions. Thaseen et al. 21 present a method based on multiclass SVM classifiers to detect intrusions. The purpose is to identify several www.nature.com/scientificreports/ classes according to the network traffic. The authors employed chi-squared filtering instead of the multiclass SVM to enhance the feature selection step. Experimentation is performed using the NSL-KDD dataset and the Libsvm library in the MATLAB environment. The achieved results proved the effectiveness of the proposed method in terms of accuracy and time costs. Kuang et al. 22 introduced an IDS model based on the multilayer SVM approach. The model comprises four SVM classifiers and an Improved Chaotic Particle Swarm Optimization (ICPSO) method. The authors sought to detect the four essential types of attacks (R2L, DoS, U2R, and probe). The presented IDS scheme is enhanced by using Principal Component Analysis (PCA) with a SVM to reduce the training time. The experimentation is conducted in the MATLAB environment using the KDD Cup dataset. The findings showed that the method improved the detection accuracy and reduced the processing time in the training and testing phases.
Jaber et al. 23 sought to model an IDS system using the clustering process. The authors combined the SVM classifier and Fuzzy C-Means (FCM) clustering method to ensure more accurate cloud computing. They conducted experiments using Weka simulation with the NSL-KDD dataset. Safaldin et al. 24 proposed an IDS scheme using the binary Gray Wolf Optimizer (GWO) as a meta-heuristic method with the SVM. The GWO algorithm to enhance parameters during SVM training. The verification of the proposed model is performed using the NSL-KDD '99 dataset.
Cheng et al. 25 aggregated the SVM classifier with the bat algorithm to design an IDS model. The bat algorithm is employed in the training phase to find the optimum parameters of the SVM. The KDD Cup '99 dataset is used in the simulation experiments. Raman et al. 26 conducted an IDS model based on an SVM and a genetic algorithm. A method called the Hypergraph-based Genetic Algorithm (HG-GA) is applied in the selection step to identify the optimum parameters for the SVM classifier. The HG-GA provided the optimal solution and avoided becoming trapped in the local minima. The IDS-based HG-GA SVM is simulated using the NSL-KDD dataset.
Kalita et al. 27 attempted to handle intrusions using an SVM and Particle Swarm Optimization (PSO). The IDS model based on the SVM classifier achieved higher accuracy when the selected parameters were well chosen. The authors applied a variant of PSO and a multi-PSO algorithm in the selection step to ensure better performance. Li et al. 28 proposed an IDS model based on the Artificial Bee Colony (ABC) algorithm for feature selection and the SVM classifier. The ABC method is enhanced using honey source coding and the neighborhood search method to retrieve the optimum parameters for the SVM.
Mehmod et al. 29 sought to improve the selection method before using an SVM classifier to identify attacks. The authors focused on useful features by avoiding noise and redundancy. The selection method is performed by applying the ant colony optimization algorithm on the KDD Cup '99 dataset. Acharya et al. 30 adopted a general approach-based SVM to design an IDS. Regarding the selection step, the authors proposed an intelligent water drop (IWD) algorithm to select the relevant features for classification. The KDD Cup '99 dataset is used to evaluate the proposed IDS.
Li et al. 31 stated that the Velocity Adaptive Shuffled Frog Leaping Bat Algorithm (VASFLBA) was an effective method for the selection process. The procedure is based on two adaptative factors to balance global and local search. The Shuffled Frog Leaping Algorithm (SFLA) improved the transfer mechanism. The selected features were trained according to the SVM classifiers on the Industrial Control System (ICS) dataset. Bostani et al. 32 designed an IDS system based on hybrid feature selection. A Binary Gravitational Search Algorithm (BGSA) and Mutual Information (MI) were used to perform the selection step. Experimentation is conducted using the NSL-KDD dataset.
Kabir et al. 33 introduced the Least Squares Support Vector Machine (LS-SVM) to build an accurate IDS. The optimum allocation algorithm proceeds to select representative samples. The IDS is tested using the KDD Cup '99 dataset. Saleh et al. 34  In light of this brief description of the related works, IDSs still face the following five challenges 37,38 : (1) Large dataset challenge A large amount of data in a dataset leads to highly time-consuming training steps. Exemplar extraction methods and clustering methods are proposed to reduce the dataset size without losing relevant information.
(2) Normalization challenge The quality of data directly influences the accuracy of intrusion detection systems.
The normalization method rebuilds data to obtain valuable data and reduces the processing time. Selecting the best normalization method is a crucial step for IDS. www.nature.com/scientificreports/ (5) Online learning challenge As an SVM does not support periodic retraining, the classifier cannot manage the requests of an online intrusion detection system. Some attempts use an online SVM to support online learning demands.
In this paper, the proposed IDS seeks to address the above challenges. The model comprises a selection method and a hybrid classifier based on the Balanced Communication-Avoiding Support Vector Machine Decision Tree (BCA-SVMDT) method. The selection method aims to select the most significant features to be trained. The BCA-SVMDT, which is discussed in the next section, ensures the training phase.

BCA-SVMDT-based intrusion detection system
The proposed model is introduced in this section. The IDS model is composed of three main modules, as shown in Fig. 1. The intrusion model is built based on a decision tree; and on a particular node, the BCA-SVM classifier is used. The illustrated model in Fig. 1 is detailed in the following sections.
Preprocessing module. Data exploration. This step is focused on the quality of the data. To ensure the accuracy of the prediction model, data exploration inspects the data to explore its features. The type of data (numerical or categorical) is verified to determine a suitable statistical or prediction model. In our case, the UNSW-NB 15 dataset is used 39 . This dataset is available online and is composed of 175,341 records. The UNSW-NB 15 dataset encompasses 44 features, including normal and attack status. The data exploration process determines three features (proto, state, and service) that are nominal. The other features are defined by numerical values (binary, integer, and floating). Nominal features have to be considered for the next step (security feature encoding) to be transformed from nominal values to numerical values.
Security feature encoding. This step encodes the nominal values determined by the data exploration step. Nominal features (proto, state, and service) are encoded using the label encoding method. This method did not create additional features like the one hot encoding method. This is why the label encoding method is chosen to transform these three features from nominal values into numerical values. The method labels the same parameter with the same numerical value. The example illustrated in Fig. 2 describes the label encoding method. The security feature encoding step is performed using the LabelEncoder method and sklearn class in Python.
Security feature normalization. This step manages data with different scales. It aims to rescale the values of all features according to a zero mean and a unit variation. The normalization process is fundamental in the training phase to provide an accurate classification model. The rescaled value is computed through the following equation.  www.nature.com/scientificreports/ D S is the scaled value, D i is the original value, D is the mean value of the feature, and the standard deviation is represented by σ . Normalization is performed for every feature that has a different distribution using the sklearn class in Python.
Ranking and selection. This step aims to select significant features that support the decision-making process. The Gini index method is applied to ensure feature ranking. It has been employed on Binary attack and benign data, whereas the Gini index works better on multiclass data 40 . The Gini index method is performed as follows: (1) it detects the impurity of the features; (2) it ranks the features based on the Gini impurity, which is defined by the entropy; and (3) it builds the decision tree. The Gini index is computed in every node using Eq. (2).
where n is a node, T is the number of all nodes and P i is the probability of a tuple.
The Gini index is applied for all features in the UNSW-NB 15 dataset. Table 1 illustrates the ranking associated with the security features.
The selection of the important security features is made according to the threshold (threshold = 0.023) which is defined through the Tree model. The Threshold value could be changed according to the used dataset. The number of selected features is reduced from 42 to 15 features. Figure 3 shows the selected features and their scores. As mentioned above, this step helps to reduce the computational complexity and increase the accuracy of the proposed decision tree-BCA-SVM classification.
Training module. The training module is performed based on hybrid BCA-SVM and decision tree methods.
The BCA-SVM classifier presents an optimized SVM version and achieves better classification results. Figure 4 illustrates the BCA-SVMDT intrusion detection tree.
The sttl feature chosen by the Gini index method is considered the root node. Branches were added based on the feature name, Gini index, samples, value, closeness measure (c), and class name. This module is performed in the local learning model according to the following steps: (1) Select the SVM kernel functions (radial basis function) with the C regulation parameter and the σ kernel parameter. These parameters are chosen according to the validation results. The BCA-SVM learning step is summarized in Fig. 5. In the next section, we detail the experiments and the evaluation of the proposed BCA-SVMDT model.

Experiments
In this section, the proposed BCA-SVMDT intrusion detection system is evaluated using the UNSW-NB 15 dataset. This dataset was created by the Cyber Range Lab of the Australian Center for Cyber Security (ACCS) 37 . As mentioned in section "BCA-SVMDT-based intrusion detection system", the dataset is composed of 42 features. In our research, only 15 relevant features that are more significant are used.
The training phase aims to build two classes: normal or attack. The nature of the attack is outside the scope of this research. For the training, the proposed model used 120,890 records. For the test phase, 16,607 records are covered. The experimentation is conducted in Python 3.8 running on a computer with a core i7 CPU and 8 GB RAM.
The evaluation is performed using four metrics: the accuracy, precision, recall, and F-score. These metrics are important for comparing the proposed IDS and some traditional Machine Learning (ML) models. The evaluation metrics are computed based on the following values: • TP (True Positives) denotes the number of correctly detected intrusions. (1) The recall represents the number of correct detections divided by all intrusion cases in the dataset. Equation 5 shows the recall formula.
The F-score metric balances the precision and recall. It is described by Eq. (6).      The proposed model is also evaluated according to the Receiver Operating Curve (ROC). The ROC curve gives an idea about the performance of the BCA-SVMDT model and the distance between the two classes: normal and attack. The ROC curve is defined by Eq. 7.   Fig. 6. In Fig. 6, the prediction model is accurate on the higher Area Under the Curve (AUC), which is approximately 0.98. Traditional models based on machine learning methods such as the SVM, k-Nearest Neighbors (k-NN), Logistic Regression (LR), and Naïve Bayes (NB) are applied to the same dataset to assess the benefits of the proposed model in depth. Figure 7 illustrates the comparison between the proposed BCA-SVMDT and the other ML methods according to the accuracy, precision, recall, and F-score metrics. The results prove that the BCA-SVMDT method for intrusion detection achieves the best performance.    www.nature.com/scientificreports/ The proposed IDS model decreases the computational complexity using the ranked security features for the selection approach. Therefore, the processing time and overfitting are improved.

Conclusion
Protecting networks from intrusions and attacks is a great challenge for cyberspace. In this paper, an attempt to provide an accurate IDS based on a hybrid approach is presented. A novel intelligent system called BCA-SVMDT composed of a decision tree and a balanced communication-avoiding support vector machine classifier is proposed to optimize the training phase. In the preprocessing module, the data are rescaled and encoded. The Gini index method is performed to compute the impurity of the security features. Our model reached a high accuracy of approximately 98.5%, a precision of approximately 96.7%, a recall of approximately 96.4%, and an F-score of approximately 96.5%. Furthermore, this paper will be a solid key to predicting the nature of attacks in future work. An enhancement of the IDS model is required by adding a filtering step to enhance the prediction and support the classification of five classes, including normal status and types of attacks.

Data availability
The datasets generated and/or analyzed during the current study are available in the Kaggle repository, https:// www. kaggle. com/ datas ets/ dhoog la/ unswn b15.